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ABSTRACT 

In  the  companion  paper,  we  addressed  the  low-power  DCT/IDCT  VLSI  architectures  of  linear  com¬ 
plexity  increase  based  on  the  multirate  approach.  In  this  paper,  we  will  discuss  other  aspects  of  the 
low-power  design.  Firstly,  we  consider  the  design  of  low-power  architectures  that  can  lower  the  power 
consumption  at  only  0(log  M )  increase  in  hardware  complexity.  Next,  we  will  extend  the  low-power 
DCT  design  to  other  orthogonal  transforms  such  as  Modulated  Lapped  Transform  (MLT)  and  Ex¬ 
tended  Lapped  Transform  (ELT).  A  unified  programmable  HR  low-power  transform  module,  which 
can  perform  most  of  the  existing  discrete  sinusoidal  transforms,  is  also  proposed.  Finally,  we  perform 
the  finite-precision  analysis  of  the  DCT  architecture  under  the  normal  and  multirate  operations.  In 
VLSI  design,  the  assignment  of  the  system  wordlength  will  directly  affect  the  total  switching  events 
and  routing  capacities,  hence  the  power  consumption.  Using  the  analytical  results,  we  can  choose 
the  optimal  wordlength  for  each  DCT  channel  under  required  signal-to-noise  ratio  (SNR)  constraint. 
The  material  presented  in  this  paper,  together  with  the  multirate  architectures  in  the  companion 
paper,  provides  a  framework  for  the  algorithm-based  low-power  transform  coding  kernel  design. 


This  work  was  supported  in  part  by  the  ONR  grant  N00014-93-10566  and  the  NSF  NYI  Award 
MIP9457397. 
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1  Introduction 

In  the  companion  paper  [1] ,  we  introduced  the  algorithm- based  low-power  design  based  on  the  multi¬ 
rate  approach.  Specifically,  we  showed  that  the  power  consumption  can  be  reduced  provided  that  we 
can  perform  the  DCT/IDCT  from  the  decimated-by-M  input  sequences  at  0(M)  increase  in  hard¬ 
ware  complexity.  In  practice,  the  O(M)  overhead  may  not  be  desirable  when  M  is  large  and  total 
chip  area  is  limited.  Therefore,  the  search  for  compensation  scheme  with  less  hardware  overhead  is 
desired.  In  this  paper,  we  will  show  a  scheme  to  perform  the  polyphase  decomposition  in  such  a  way 
that  only  O(logM)  overhead  is  required  to  compensate  the  speed  penalty.  The  resulting  structure 
reduces  the  operating  frequency  on  a  stage-by-stage  base:  In  each  stage,  the  operating  frequency 
is  reduced  by  half.  After  reaching  to  the  (log  M)th  stage,  we  can  operate  at  M- times  slower  clock 
rate  of  the  original  data  rate.  We  shall  refer  to  this  as  logarithmic  low-power  architecture.  This 
multiple  operation  frequency  environment  allows  us  to  perform  different  speed  compensation  at  each 
stage;  i.e.,  different  low  supply  voltages  can  be  used  to  lower  the  power  consumption.  In  general,  the 
power  savings  of  the  logarithmic  architecture  is  between  the  normal  HR  architecture  [2]  and  the  full 
multirate  architecture  presented  in  the  companion  paper  [1], 

Next  we  extend  the  low-power  design  presented  in  the  companion  paper  to  a  larger  class  of 
orthogonal  transforms.  We  start  with  the  low-power  design  of  the  Modulated  Lapped  Transform 
(MLT)  and  Extended  Lapped  Transform  (ELT).  The  MLT  and  ELT,  which  belong  to  the  family  of 
Lapped  Orthogonal  Transforms  (LOT),  are  very  attractive  in  the  applications  of  transform  coding 
since  they  can  diminish  the  blocking  effect  encountered  in  low  bit-rate  block  transforms  [3]  [4]  [5]. 
Recently,  Frantzeskakis  et  al.  [6]  [7]  proposed  the  time-recursive  MLT  and  ELT  architectures  that  are 
suitable  for  VLSI  implementation  due  to  their  modularity  and  regularity.  However,  since  the  updating 
of  the  MLT  and  ELT  coefficients  should  be  as  fast  as  the  input  data  rate,  those  architectures  cannot 
compensate  the  speed  penalty  under  low  supply  voltage.  In  this  paper,  we  will  derive  the  low-power 
time-recursive  MLT  and  ELT  structures.  By  applying  the  polyphase  decomposition  to  their  HR 
transfer  functions,  the  MLT/ELT  coefficients  can  be  updated  at  M-times  slower  rate  with  linear 
hardware  overhead;  hence,  the  low-power  operation  is  allowed.  Later,  based  on  the  derivations  of 
the  MLT  and  ELT,  we  propose  a  unified  low-power  IIR  structure  which  can  be  implemented  as  a 
programmable  DSP  co-processor  to  perform  most  of  the  existing  sinusoidal  transforms. 

In  the  last  part  of  this  paper,  we  will  consider  the  finite-wordlength  effect  of  the  proposed  low- 
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power  DCT  architectures.  The  effect  of  wordlength  on  the  power  consumption  was  discussed  in  [8]. 
In  summary,  shorter  wordlengths  will  result  in  fewer  switching  events,  lower  capacitance,  and  shorter 
average  routing  length  in  the  system.  As  a  result,  low  power  consumption  of  the  chip  can  be  achieved. 
On  the  other  hand,  if  the  wordlengths  are  too  short,  the  rounding  error  caused  by  finite-precision 
operations  can  be  severe  enough  to  hazard  the  signal-to- noise  ratio  (SNR).  Thus,  choosing  minimum 
wordlengths  without  degrading  the  SNR  requirement  is  an  important  issue  in  the  low-power  VLSI 
design.  Motivated  by  this,  we  perform  the  finite-precision  analysis  for  the  HR  DCT  structure  and  its 
low-power  design.  Our  study  can  precisely  predict  the  finite-precision  behavior  under  different  block 
sizes  and  decimation  factors.  Using  these  analytical  results,  we  can  assign  the  optimal  wordlength  for 
each  DCT  channel  given  the  SNR  constraint.  Moreover,  our  analyses  show  that  the  average  SNR’s 
of  the  proposed  low-power  architectures  are  better  than  that  of  the  normal  design  given  the  same 
wordlength  assignment.  This  indicates  the  multirate  design  has  better  numerical  properties  under 
fix-point  arithmetic. 

The  organization  of  the  this  paper  is  as  follows:  Section  2  presents  the  low-power  DCT  architecture 
of  logarithmic  complexity.  In  section  3,  we  derive  the  multirate  MLT  and  ELT  algorithms  and 
architectures.  Then,  a  unified  low-power  HR  structure  for  most  sinusoidal  transforms  are  described. 
The  fixed-point  analysis  is  presented  in  Section  4  followed  by  a  conclusion. 

2  Low-Power  Architecture  of  Logarithmic  Complexity 

In  the  companion  paper,  we  have  shown  how  to  perform  the  DCT/IDCT  from  the  decimated-by-M 
input  sequences  so  that  the  speed  penalty  under  low-power  operation  can  be  compensated  at  the 
algorithmic/architectural  level.  The  advantage  of  this  design  is  obtained  by  applying  the  polyphase 
decomposition  to  the  HR  transfer  function  until  the  resulting  transfer  function  is  fully  expanded 
with  all  exponents  being  multiples  of  M.  However,  such  manipulation  requires  O(M)  overhead  in 
hardware,  which  may  not  be  acceptable  when  M  is  large  and  the  chip  area  is  limited.  In  this  section, 
we  will  show  how  to  achieve  low-power  consumption  with  only  logarithmic  complexity  overhead.  The 
basic  priciple  is  to  repeat  the  polyphase  decomposition  in  a  certain  way  instead  of  fully  expanding 
them.  By  doing  so,  the  lower-rate  operations  can  be  obtained  while  the  complexity  will  grow  slower. 
The  price  paid  is  that  the  resulting  architecture  will  be  operated  at  multiple  low  frquencies  rather  than 
at  the  uniform  low  frequency  as  discussed  in  [1].  Nevertheless,  the  multiple  frequency  environment 
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enables  us  to  perform  different  speed  compensations  at  different  stages  of  the  design.  Therefore, 
different  low  supply  voltages  can  be  applied  according  to  the  given  speed  constraint,  and  the  total 
power  consumption  can  be  still  reduced.  In  what  follows,  we  will  derive  the  logarithmic  low-power 
DCT  architecture.  The  results  can  be  extended  to  other  low-power  transformation  designs  to  be 
discussed  in  Section  3. 

2.1  Low-Power  DCT  Architecture  of  Logarithmic  Complexity 

The  multirate  HR  DCT  transfer  function  with  M  =  2  can  be  written  as  [1] 

[«o(*2)  +  z-'iM*2)]  (1) 

where  C(k)  is  the  scaling  factor  of  the  DCT  and 

D(z2)  =  1  -  2cos4u;jtz-2  +  z~4, 

Bo(z2)  =  (COS  LUk  -  cosSu^z'2), 

H\(z2)  =  (cos3u>fc  —  coswfcZ-2).  (2) 

Substituting  the  polyphase  decomposition 


1 

Sf 

1 

+ 

5^ 

(3) 

D(z 2)  ~ 

D'(z 4) 

D'(z4)  = 

1  —  2cos8a>fcZ-4  +  z-8, 

II 

l  +  z~4, 

ff{<24)  = 

2cos4o>fc, 

(4) 

into  (1)  and  rearranging,  we  can  rewrite  HpcT,k(z)  so  that  the  DCT  can  be  computed  at  four  times 
slower  clock  rate  [1],  Nevertheless,  this  multirate  design  requires  0(M)  hardware  overhead  to  directly 
lower  the  input  clock  rate  by  four.  In  order  to  save  the  hardware  complexity,  we  may  rewirte  (1)  in 
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a  cascade  form  after  the  substitution  is  made,  i.e., 

HDCT,k(z)  =  (-1  )kC(k)  [. H0(z 2)  +  z-'H.iz2)]  [h',(z4)  +  z-2H[(z4)\  •  (5) 

Fig- 1(a)  shows  the  polyphase  implementation  of  (5),  which  leads  to  the  cascade  multirate  DCT 
architecture  depicted  in  Fig.  1(b).  There  are  two  major  blocks.  One  operates  at  50%  sample  rate  and 
the  other  at  25%  sample  rate.  Due  to  the  special  form  of  the  denominator  of  the  transfer  function, 
we  can  repeatedly  perform  the  polyphase  decomposition  on  the  denominator  and  retain  the  cascade 
form.  We  then  have 

logM-l 

^(1  +  z~2%+1)  +  2z~r  cos(2J+1a)fc)j 

=  <-l)*C<*)  K’)  +  *(**)]  1=1  !-2co -  <«) 

for  any  M,  M  €  2Z+ .  The  resulting  architecture  decimates  the  operating  frequency  on  a  stage-by- 
stage  base:  In  each  stage,  the  operating  frequency  is  reduced  by  half.  After  reaching  the  (log  M) th 
stage,  we  will  have  M  times  slower  clock  rate  of  the  original  data  rate. 


2.2  Power  Consumption 

When  low-power  implementation  is  taken  into  consideration,  the  feature  of  multiple  operating  fre¬ 
quencies  in  the  above  architecture  implies  that  different  supply  voltages  will  be  used  according  to  the 
slowest  allowable  operating  speed.  That  is,  the  operators  to  realize  Ho(z2)  and  H\{z2)  in  (5)  can  be 
operated  at  3. IF  due  to  the  two  times  slower  clock  rate,  while  all  other  operators  to  realize  Hq(z 4) 
and  H[(za)  can  be  operated  at  2. IF  due  to  the  four  times  slower  clock  rate  [9][1].  As  a  consequence, 
the  power  consumption  of  the  16-point  low-power  DCT  architecture  in  Fig.  1(b)  can  be  estimated  as 


)  V'  vIVo 


0.24P0, 


(7) 


where  Nq  =  30  is  the  total  multipliers  used  in  the  normal  DCT  (M  =  0);  IV2  =  30  and  IV4  =  30  are 
the  number  of  multipliers  in  the  M  —  2  stage  and  M  =  4  stage,  respectively.  From  (7),  we  can  see 
that  the  overall  power  consumption  of  the  logarithmic  low-power  design  will  be  in  between  M  —  2 
and  M  =  4  of  the  full  multirate  DCT  systems  discussed  in  [1]. 

On  the  other  hand,  by  examing  (6),  we  can  see  that  in  order  to  have  M-times  slower  operating 
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frequency  at  the  final  stage,  we  need  a  total  of  (log  M  +  2)  multipliers  to  realize  the  multirate  transfer 
function.  The  comparison  of  the  logarithmic  low-power  architecture  with  other  approaches  is  listed 
in  Table  1.  Although  the  total  power  savings  of  the  logarithmic  structure  is  less  than  that  of  the 
full  multirate  structure  given  the  same  decimation  factor  M,  the  O(logM)  hardware  overhead  is 
preferable  when  we  want  to  achieve  low-power  consumption  without  trading  too  much  chip  area. 

The  multiple-frequency  feature  of  the  cascade  low-power  architecture  also  allows  us  to  achieve 
more  power  and  area  savings  at  the  arithmetic  level.  For  example,  we  can  use  look-ahead  adders  in 
the  M  =  2  region  to  match  the  data  throughput  rate,  whereas  we  can  employ  low-speed  carry-ripple 
adders  in  the  M  —  4  region  due  to  the  much  relaxed  speed  constraint. 

3  Unified  Low-Power  Module  Design 

3.1  The  IIR  MLT  Algorithm 

The  MLT  operates  on  segments  of  data  of  length  2 N,  x(t  +  n  —  2 N  +  1),  n  =  0, 1,  •  •  • ,  2 N  -  1,  and 
produces  N  output  coefficients,  X  mlt  ,k(k)  k  =  0, 1,  •  •  • ,  N  -  1,  as  follows  [4]: 

nr2N~l  t r  l  7r  l  IN 

XMLTAt)  =  mJ4j  £  »ii^v(»+2)cos[^('=+2)("+2+Y)ll(f  +  ”“2A'  +  1)  (8) 

n=0 

where  S(k)  =  (-l)(fc+2)/2  if  k  is  even,  and  S(k)  =  (-l)(fc-1)/2  if  k  is  odd.  After  some  algebraic 
manipulations,  the  MLT  can  be  decomposed  into  [7] 

X MLT ,k{k)  =  -S{k)[  Xc,k+l{t)  +  Xs,k(t)  ],  (9) 


where 


Xc,k(t) 


XS,k(t) 


L- 1 

=  Pi  ^2  cos[(2n  -I-  1  )uk  +  9k]x(t  +  n  —  2 N  +  1), 

71=0 

L—l 

=  Pi  ^2  sin[(2n  +  l)u>fc  +  9k]x(t  +  n  —  2 N  +  1), 
71=0 


(10) 

(11) 


with  block  size  L  =  2 N  and 
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The  HR  transfer  functions  for  (10)  and  (11)  can  be  computed  as 


HcM 

Hs,k{z) 


O  n  c°s((2 L  -  l)uk  +  ek)  -  cos((2 L  +  1  )uk  +  9k)z~l 

PU  ’  l-2co$2ukz-'  +  z-* 

L  sin((2L  -  l)wk  +  9k)  -  sin((2L  +  l)wk  +  ek)z~l 
PU  ’  1  -2cos2w^-1  +  ^2 


(13) 

(14) 


The  corresponding  HR  module  for  the  dual  generation  of  Xc,k{t)  and  Xs,k(t)  is  depicted  in  Fig.2, 
where 


Ti  =  &cos((2 L  -  1  )uk+ek),  r2  =  -Acos((2L  +  l)uk+9k), 

r3  =  A  sin((2L  -  lVfc+4),  r4  =  -A  sin((2 L  +  l)wfc+0fc).  (15) 

This  HR  module  can  be  used  as  a  basic  building  block  to  implement  MLT  according  to  (9).  Fig.3 
illustrates  the  overall  time-recursive  MLT  architecture  for  the  case  TV  =  8.  It  consists  of  two  parts: 
One  is  the  IIR  module  array  which  computes  Xcyk(t)  and  Xs,k(t)  with  different  index  k  in  parallel. 
The  other  is  the  combination  circuit  which  selects  and  combines  the  outputs  of  the  IIR  array  to 
generate  the  MLT  coefficients. 

3.2  Low-Power  Design  of  the  MLT 

As  with  the  low-power  DCT,  we  can  have  a  low-power  MLT  architecture  if  each  MLT  module  can 
compute  Xc,k(t)  and  Xsjk{t)  using  the  decimated  input  sequences.  After  performing  the  polyphase 
decomposition  on  (13)  and  (14),  we  can  compute  the  multirate  IIR  transfer  functions  for  Hc,k(z) 
and  Hs>k(z )  as 

Hc,k(z )  =  1  -  2cos(4  +  -z~2  (jcos^2L  -  3)w k  +  ~  cos((2L  +  1  )uk  +  @k)z~^]Xe(z) 

+  [cos((2L  -  l)ajk  +  9k)  -  cos((2 L  +  3)wfc  +  9k)z~1]X0(z)sj  ,  (16) 


and 


Hs,k(z)  = 


+ 


1  -  +  z~2  0sin((2L  ~  “  sin((2i  +  !)wfc  +  0k)z  ^Xeiz) 

[sin((2L  -  l)u>k  +  9k)  -  sin((2L  +  3)uk  +  0fc),z-1]Xo(;z)  )  . 


(17) 
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The  parallel  architecture  for  (16)  and  (17)  is  shown  in  Fig.4,  where 


ri,e  =  A  cos((2L  -  3)u>k+0k), 
r3,e  =  A  sin((2L  -  3)u>k+dk), 
ri,o  =  A  cos((2L  -  l)uk+Ok), 
^3,0  =  A  sin((2L  -  l)u}k+ek), 


^2,e  =  -A  cos((2L  +  1  )ujk  +  0k), 
r4,e  =  —  A  sin((2L  +  1  )uk  +  9k), 
r2,0  —  — A  cos((2L  +  3)uk-\-9k), 
r4,0  —  — A  sin((2L  +  3)0;^+^)- 


(18) 


It  consists  of  two  MLT  modules  in  Fig.2.  The  upper  module  computes  part  of  the  Xc,k(t)  and 
Xs,k(t)  from  the  even  sequence,  while  the  lower  one  computes  the  remaining  part  from  the  odd 
sequence.  The  two  adders  at  the  right  end  are  used  to  combine  the  even  and  odd  outputs.  Through 
such  manipulation,  only  decimated  sequences  are  processed  inside  the  module.  Hence,  the  MLT 
module  can  operate  at  the  half  of  the  original  frequency  by  doubling  the  hardware  complexity.  The 
comparison  of  hardware  cost  is  shown  in  Table  2.  Suppose  that  Pq  denotes  the  power  consumption  of 
the  MLT  module  in  Fig.2.  From  the  CMOS  power  model,  it  can  be  shown  that  the  power  consumption 
for  the  low-power  MLT  modules  are  0.38Po  and  0.17Po  for  the  case  M  =  2  and  M  =  4,  respectively. 
Basically,  this  savings  is  obtained  at  the  expense  of  linear  increase  in  hardware. 


3.3  Low-Power  Design  of  the  ELT 

The  ELT  with  basis  length  equal  to  4 N  operates  on  data  segment  of  length  AN,  x(t+n  —  4N  + 1),  n  = 
0, 1,  •  •  • ,  47V  —  1,  and  produces  TV  output  coefficients,  Xelt ,k(t),k  =  0, 1,  •  •  • ,  TV  —  1.  One  good  choice 
for  the  ELT  is  as  follows  [10]  [5]: 

I  c\  4  N — 1  11  1  11 

XELT,k{t)  =  \Jjj  E  -  2  C0S  ^(n  +  2 )] COs[77 {k+2){n+2  +  T)] x(t+n-iN  +  !)  (19) 

By  the  use  of  some  trigonometric  identities,  we  can  rewrite  (19)  as 


X  elt  ,k{t)  =  ~Xs,k+i(t)  +  V2Xc,k(t)  +  Xs,k-i  (t),  (20) 


where 


L—l 


Xc,k(t )  =  ft^c°s[(2n  +  1)4  +  ^]i(Pn-4i'r  +  1), 

n= 0 


(21) 
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XS,k(t)  =  02  sin[(2n  +  1)<4  +  Q'k\x{t  +  n  -  AN  +  1), 

n=0 


(22) 


with 


L  =  4N,  fc=  1 


/  A  7T  1  .  A  7T.,  1, 

/ — >  =  rT7(&  +  r)i  and  Qk  =  —  (k  +  ~). 

2y/2N  2Nk  2h  k  2V  2’ 


(23) 


Define  the  relationship  in  (9)  and  (20)  as  the  combination  functions.  After  comparing  (9)- (12)  with 
(20)-(23),  we  see  that  the  MLT  and  ELT  have  identical  mathematical  structures  except  for  the 
definitions  of  parameters  and  the  combination  functions.  Therefore,  the  HR  MLT  module  in  Fig.2, 
as  well  as  the  low-power  MLT  module  in  Fig.4,  can  be  readily  applied  to  ELT  by  simply  modifying 
those  multiplier  coefficients.  Also,  the  overall  ELT  architecture  is  similar  to  the  MLT  architecture  in 
Fig.3  except  that  the  combination  circuit  performs  according  to  (20). 

Moreover,  it  can  be  verified  that  Xs,-i (t)  =  -Xs,o(t)  and  XSyN(t)  =  Xs,N-l(t).  Hence,  we  can 
compute  the  0th  and  (N  —  l)th  ELT  coefficients  from 


XELT,o(t )  =  ~Xs,i(t)  +  V2XCfi(t)-Xs,-i(t), 

Xelt,n~i  (t)  =  -Xs,N-l(t)  +  V2Xc,N-l(t)  +  Xs,N-2{t),  (24) 

instead  of  implementing  two  extra  ELT  modules  for  Xs_i(t)  and  Xs,N(t).  The  hardware  cost  for 
the  ELT  can  be  found  in  Table  2.  Since  the  number  of  multipliers  of  the  ELT  is  about  the  same  as 
that  of  the  MLT,  the  power  savings  for  both  transforms  are  similar. 

3.4  Unified  Low-Power  IIR  Transform  Module  Design 

From  the  transform  functions  described  in  (9)- (12)  and  (20)- (23),  we  observe  that  the  low-power  MLT 
module  in  Fig.4  can  be  used  to  realize  most  existing  discrete  sinusoidal  transforms  by  suitably  setting 
the  parameters  and  defining  the  combination  functions.  For  example,  Xc,k{t )  in  (10)  is  equivalent 
to  the  DCT  by  setting 

f\7T 

L  =  N,  fii  —  C(k),  ujk  =  — ,  and  0k  =  0.  (25) 

As  a  result,  the  multirate  MLT  module  in  Fig.4  can  compute  the  DCT  with  different  index  k  in 
parallel. 
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The  other  example  is  the  discrete  Fourier  transform  (DFT)  with  real-valued  inputs.  With  the 
following  parameter  setting 

1  _ /b7T 

L  —  N,  jd\  —  -j=,  uk  =  and  6k  =  -uk,  (26) 

(10)  and  (11)  become 

1  N~ 1  —2n 

XcA*)  =  Y  cos(—jy~kn)  x(t  +  n  —  N  +  1),  (27) 

v-W  n=o 
1  JV_1  -2tt 

XsM*)  =  Y  sin(~^-fcn)  x(t  +  n  —  N  +  1),  (28) 

which  are  the  real  part  and  the  imaginary  part  of  the  DFT,  respectively.  The  discrete  Hartley 
transform  (DHT)  can  be  computed  using  the  same  parameter  setting  as  the  DFT  except  that  the 
combination  circuit  in  Fig.3  performs  as 

XdhtA1)  =  XcA*)  +  XsA*)-  (29) 

The  parameter  settings  as  well  as  the  corresponding  combination  functions  for  other  orthogonal 
transforms  are  summarized  in  Table  3. 

The  programmable  feature  of  the  unified  low-power  module  design  makes  it  very  attractive 
in  transform  coding  applications.  Firstly,  the  unified  structure  can  be  implemented  as  a  high- 
performance  programmable  co-processor  which  performs  various  transforms  for  the  host  processor 
by  loading  the  suitable  parameters.  Secondly,  by  hard-wiring  the  multiplier  coefficients  of  the  mod¬ 
ules  to  preset  values  according  to  the  transformation  type,  we  can  perform  any  one  of  the  discrete 
sinusoidal  transforms  using  the  same  architecture.  This  can  significantly  reduce  the  design  cycle  as 
well  as  the  manufacturing  cost. 

3.5  Extension  to  Low-Power  2-D  Transforms 

Extension  of  our  low-power  transform  algorithms  to  low-power  two-dimensional  (2-D)  transforms  can 
be  achieved  by  employing  the  time-recursive  2-D  DCT  architecture  proposed  by  Chiu  and  Liu  [11].  In 
general,  the  architecture  in  [11]  can  be  applied  to  all  transformations  with  SIPO  property.  Therefore, 
we  can  apply  it  to  implement  the  low-power  2-D  transforms  with  some  minor  modifications. 
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4  Finite-Precision  Analysis  of  The  HR  DCT  Architecture 

In  low-power  VLSI  implementation,  the  choice  of  wordlength  is  an  important  issue  since  it  will  directly 
affect  the  total  switching  activities  inside  the  opeartors  as  well  as  the  total  effective  capacitance. 
Besides,  an  underestimated  wordlength  will  degrade  the  system  performance  due  to  the  increased 
rounding  errors.  Therefore,  we  should  carefully  determine  the  minimum  allowable  system  wordlength 
that  meets  the  accuracy  criteria  for  cost-effective  implemention.  In  this  section,  we  will  consider 
the  finite-precision  effects  of  the  proposed  low-power  DCT  architectures.  The  results  can  be  easily 
extended  to  other  transform  architectures.  We  will  start  with  the  DCT  architecture  under  the  normal 
operation,  then  the  analysis  is  extended  to  the  low-power  design  with  M  =  2.  The  general  results 
for  arbitrary  M  is  also  presented.  Throughout  the  derivations,  the  “statistical  error  model”  for 
fixed-point  analysis  is  used  [12,  chap. 6]: 

1 .  The  rounding  error  is  treated  as  wide-stationary  additive  white  noise  with  magnitude  uniformly 
distributed  over  one  quantization  level. 

2.  Rounding  error  occurs  only  in  multiplication. 

3.  All  errors  are  uncorrelated  with  the  input  signal,  and  are  independent  of  each  other. 

4.1  Basic  Considerations  in  Finite-Precision  Analysis 

There  are  two  basic  considerations  in  the  fixed-point  analysis.  One  is  the  rounding  error  behavior. 
It  occurs  when  we  multiply  two  ( B  -(-  l)-bit  numbers  together  while  only  (B  +  l)-bit  product  is  kept. 
The  mean  and  variance  of  the  rounding  error  are  given  by  [12,  chap. 6] 

2~2B 

mR  =  0,  o\  =  (30) 

respectively.  Understanding  the  roudning  error  behavior  will  allow  us  to  minimize  the  wordlength  to 
achieve  a  desired  output  signal-to-noise  ratio. 

The  other  is  the  dynamic  range  issue.  In  fixed-point  implementation,  each  number  in  the  system 
is  treated  as  a  fraction.  The  magnitude  of  each  node  in  the  circuit  cannot  exceed  one,  otherwise 
overflow  occurs  and  will  result  in  great  distortion  in  the  final  output.  Therefore,  to  prevent  overflow, 
a  suitable  scaling  of  the  input  signal  is  usually  employed  according  to  the  dynamic  range  of  the 
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system.  In  practice,  the  signal-to-noise  ratio  of  the  scaled  system,  SNR' ,  will  be  degraded  by  the 
scaling  process  and  is  given  by  [12,  chap. 6] 

SNR'  =  s2SNRq,  (31) 

where  s  is  the  scaling  factor  and  SNRq  is  the  signal-to-noise  ratio  of  the  original  system.  This  implies 
that  knowing  the  dynamic  range  will  enable  us  to  perform  minimally  necessary  scaling  to  prevent 
further  degradation  in  SNR. 


4.2  IIR  DCT  Using  Direct  Form  I  Structure 
4.2.1  Rounding  Errors 

Using  the  statistical  error  model,  the  rounding  error  of  the  IIR  DCT  structure  can  be  modeled  as 
(see  Fig.5) 

e(t)  =  ei(t)  +e2{t)  (32) 

where  ej(t),  i  =  1,2  is  the  rounding  error  caused  by  the  ith  multiplier  in  the  circuit  l.  Then  the 
actual  output  of  the  DCT  circuit  after  N  iterations  can  be  represented  as 

NDCT,k{t)  =  X-DCT^it)  +  f{t)  (33) 


where  f(t)  is  the  output  error  due  to  the  noise  error  e(t). 

Let  Hef(z)  denote  the  transfer  function  of  the  system  from  the  node  at  which  e(t)  is  injected  to 
the  output,  and  hef(n)  be  the  corresponding  unit-sample  response.  From  Fig.5,  Hef(z)  is  given  by 


Hef(z) 


_ 1 _ 

1  —  2  cos  2ukZ~ 1  +  z~2  ’ 


(34) 


and  hef(n)  can  be  derived  as 


hef(n ) 


_ 1_ 

sin(2wfc) 


sin[(n  +  l)2uk]u(n) 


(35) 


1  Since  the  0tft  channel  of  the  DCT  is  computed  by  a  simple  add-and-accumulate  operation,  we  will  not  consider  the 
finite-wordlength  effect  of  this  channel. 
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where  u(n)  denotes  the  step  function.  Since  only  N  iterations  are  performed  in  the  HR  circuit,  the 
mean  and  variance  of  /(f)  of  the  kth  DCT  channel  can  be  computed  as 


N-l 


N- 1 


mf  = 


=  mP 


— A  ^  1 
E  hef(n )  =  rne  -r,2  y  sin[(n  +  l)2w*], 

n= 0  ' 


n= 0 
N-l 


N-l 


o) 


V2e  E  IMn)l2  =  2(1  T  E  sin2[(n  +  1)2(0*]  -  .  . 

sm2(2wfc)  ^  sin2(2wfc)  V  2 


n=0 


JV 


where 


(36) 

(37) 


me  -  E{e(t)}  =  0,  (38) 

o2e  =  JB{e2(i)}  =  ^{(ei(t))2}  +  £;{(e2(t))2}  =  (l  +  lVs(A;))-4,  (39) 


and  iVs(fc)  is  the  number  of  the  noise  sources  contributed  by  the  multiplier  M2  =  2cos(2u;fc)  in  the 
HR  loop: 


Ns(k)  -  < 


4, 

1, 

0, 


if  |2cos(2o;fc)|  >  1, 
if  |2cos(2u>a;)|  <  1, 
if  |2cos(2wfe)|  =  1. 


(40) 


When  |2cos(2w/c)|  <  1,  a  normal  multiplication  is  performed  and  E{(e2(t))2}  =  c\.  In  the  case  of 
|2cos(2a>fc)|  >  1,  since  a  left-shift  is  performed  after  the  multiplication  with  cos(2wfc),  the  rounding 
error  is  amplified  by  2  and  its  power  becomes  E{(2e2(t))2}  =  4  •  o\.  In  the  case  of  |2cos(2u>fc)|  =  1, 
no  mulplication  is  performed,  hence  E{(e2{t))2}  —  0.  Now  using  (36)-(40),  we  can  represent  the  total 
noise  power  at  the  kth  DCT  channel  as 


Pf  =  mj  +  oj  — 


N(Ns(k )  +  1) 
2  sin2(2u;fc) 


(41) 


As  we  can  see,  given  the  system  wordlength  B,  the  rounding  error  grows  linearly  with  the  block 
size  N.  This  indicates  that  we  will  have  3  dB  degradation  in  the  SNR  as  N  doubles;  however,  such 
degradation  can  be  compensated  by  adding  1/2  (in  average)  bit  in  the  wordlength.  On  the  other 
hand,  the  noise  power  is  inversely  proportional  to  sin2(2wfc).  That  is,  the  effect  of  the  rounding  error 
in  each  channel  of  the  HR  DCT  greatly  depends  on  the  pole  locations  of  the  HR  transfer  function. 
The  closer  2u>k  is  to  0  or  n,  the  larger  the  rounding  error  is.  As  a  consequence,  the  Is4  and  ( N  —  l)th 
DCT  channels  suffer  most  from  the  finite-wordlength  effect,  while  the  middle  channels  have  good 
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SNR  in  terms  of  rounding  error.  This  phenomenon  is  quite  different  from  what  we  have  seen  in  other 
DCT  algorithms  (c/,  Fig.7  in  [13]). 

4.2.2  Dynamic  Range 

In  fixed-point  arithmetic,  the  input  sequence  x(t)  is  represented  as  a  fraction  and  is  bounded  by 
|ar(i)|  <  1.  Hence,  the  dynamic  range  of  the  circled  nodes  in  Fig.6  can  be  computed  as 

Dx  =  2, 

N- 1 

D2  =  ma x{XDCT,k(t)}  =  max{C(fc)  ^  cos[(2n  +  1  )uk]x{n)} 

n= 0 

N- 1  N—l 

=  owl  |cos[(2n  +  l)w*;]|  •  max{a;(n)}  =  C(k)  ^  |cos[(2n  +  1)oj*] | , 

n— 0  n= 0 

and  the  dynamic  range  of  the  overall  architecture  is  given  by 

D  =  max{Di,D2}.  (44) 

Suppose  that  a  one-time  scaling  scheme  is  provided  at  the  input  end  to  avoid  overflow,  and  it  is  done 
by  shifting  the  data  to  the  right  by  K  bits.  We  have 


(42) 


(43) 


K=\log2D], 


(45) 


and  the  scaling  factor  s  is  given  by 


(46) 


4.2.3  Optimal  Wordlength  Assignment 

Assume  that  the  input  sequence  x(t )  is  uniformly  distributed  over  (—1,1)  with  zero  mean.  From 
(31),  (41),  and  (46),  we  have 

(jato/  _  2E{(xDCT,kit))2}  _  8  sin2 (2 Wk)  n2B—2K 

Sm  —  s  - - -  =  wmTT).2 


(47) 
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where  the  fact  that  [13] 


E{(Xdc T,k(t))2}  =  E{x2(t)}  =  1/3,  k  =  1, 2, . . . ,  N  —  1, 


(48) 


is  used.  If  we  want  to  achieve  a  performance  of  40  dB  in  SNR  for  the  kth  DCT  component,  the 
optimal  wordlength  Bk  for  that  channel  can  be  computed  from  (47)  as 


Bk 


4-log1Q[sin2(2a;fc)-  N(Ns\k)+l)] 

2  •  log10  2  A 


(49) 


As  an  example,  the  B^s  for  the  case  N  =  8  and  16  under  the  constraint  SNR  =  40  dB  are  listed 
in  Table  4(a),  where  B, 4  denotes  the  average  system  wordlength.  As  we  can  see,  Ba  —  12  bit  is 
sufficient  to  meet  the  accuracy  criteria.  Compared  with  the  DCT  implemenation  in  [14],  in  which 
Ba  was  chosen  to  be  16  bit  based  on  the  experimental  simulation  results,  our  system  wordlength  is 
much  shorter.  Suppose  that  the  silicon  area  of  the  multiplier  is  dominant  in  the  chip  and  the  size  of 
the  multipliers  is  proportional  to  (Ba)2-  Using  the  optimal  wordlengths  in  Table  4,  we  can  reduce  the 
total  chip  area  to  56%  of  the  original  design  without  degrading  the  SNR  performance.  This  shows 
that  our  analysis  approach  provides  more  insights  to  determine  the  architectural  specifications  than 
the  experimental  approach.  Moreover,  in  the  applications  of  transform  coding,  we  can  shorten  the 
wordlengths  for  the  high-frequency  channels  since  the  human  vision  system  is  less  sensitive  to  these 
components.  Thus,  the  total  wordlength  can  be  further  reduced. 


4.3  HR  DCT  Using  Direct  Form  II  Structure 

Given  the  HR  DCT  transfer  function,  we  can  also  implement  it  using  the  direct  form  II  structure 
as  shown  in  Fig.  7.  Following  the  above  derivations  for  the  direct  form  I  structure,  the  fixed-point 
analytical  results  can  be  derived  as: 

1.  Rounding  error: 

o-2B 

Pf  =  (Ns(k)  +  1)  — .  (50) 


D 1 


1 

sin(2o;fe)| 


N—l 


E 


|sin[(2n  +  l)wfc]| , 


2.  The  dynamic  range: 
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D2  =  2, 

D  =  ma x{DuD2}.  (51) 


In  contrast  to  the  direct  form  I  structure,  the  dynamic  range  of  the  direct  form  II  structure  is 
affected  by  the  factor  in  that  is,  we  will  have  non-uniform  dynamic  ranges  for  different 

DCT  channels.  This  feature  is  not  desirable  in  real  implementations  even  though  the  SNR  results 
of  both  structures  are  comparative  to  each  other  (see  simulation  results  in  Section  4.5)— It  not  only 
requires  different  scaling  scheme  in  each  DCT  channel,  but  also  makes  the  data  interface  between 
VLSI  modules  complicated  ( e.g .  2-D  DCT  in  which  two  DCT  modules  are  connected.).  Therefore, 
the  direct  form  I  is  a  better  choice  for  the  VLSI  implementation  of  the  HR  DCT  structures. 


4.4  Analysis  for  the  Low-Power  HR  DCT  with  M  =  2 

In  the  low-power  HR  DCT  architecture  with  M  =  2,  the  injected  rounding  error  can  be  modeled  as 
(see  Fig.8) 

e(t)  =  ei(t)  +  e2(t)  +  e3(t)  (52) 


and  its  power  is  given  by 

<j2e=E{e2(t)}  =  (2  +  Ns(k))a2R.  (53) 


Note  that 


Hef(z)  =  — 


(54) 


2  cos  4ukZ~1  +  z~2 ' 

and  the  total  iteration  is  reduced  to  N j 2.  Thus,  the  total  power  of  the  rounding  error  at  the  output 
becomes 

a)  =  a}  Y  M„)|*  =  (55) 

f  *  sin2(4u;fc)  V  4 ;  4sin2(4o;fc)  ; 

From  (55),  we  observe  that 


1.  Although  the  total  number  of  noise  sources  increases,  the  total  noise  power  is  compensated  by 
the  halved  number  of  iterations. 


2.  Compared  with  the  factor  (41),  the  factor  ^(4^2  in  (55)  will  have  similar  effect 

on  the  SNR  of  each  DCT  channel  but  with  halved  period. 
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Now  let  us  consider  the  dynamic  range  of  the  low-power  DCT  structure  with  M  =  2.  Given  the 
assumption  that  the  input  sequence  x(t)  is  an  i.i.d.  sequence,  the  decimated  inputs  xe(t )  and  x0(t) 
are  also  i.i.d.  sequences  and  are  uncorrelated  with  each  other.  Thus,  we  can  apply  the  technique 
of  “superposition”  to  analyze  the  dynamic  range  of  the  system:  We  first  set  xQ(t)  to  zero  while 
analyzing  the  dynamic  range  contributed  by  xe(t);  then  we  perform  the  same  analysis  for  x0(t)  by 
setting  xe(t )  to  zero.  The  overall  D  can  be  found  from  the  summation  of  the  two  dynamic  ranges, 
which  is  given  by  (see  Appendix) 


D i  =  2C(k)  (|  cosojfcl  +  |  cos3u>fc|) , 

N/2-1 

z>2  =  C(k)  (|cos[(4n  +  l)o;fc]|  +  |cos[(4n  +  3)wfc]|), 

71=0 

D  -  max{T)i,D2}.  (56) 


Using  the  analytical  results  in  (55)  and  (56),  we  can  also  find  the  optimal  wordlengths  for  N  =  8  and 
16  under  the  40dB  SNR  constraint.  The  results  are  listed  in  Table  4(b).  It  is  interesting  to  note  that 
the  average  wordlengths  of  the  multirate  DCT  architectures  are  even  less  than  those  of  the  normal 
DCT  architectures.  This  is  due  to  the  fact  that  the  number  of  the  iterations  in  the  HR  loop  will 
be  reduced  to  N/M.  As  M  increases,  the  accumulation  of  the  rounding  errors  becomes  smaller  and 
thus  less  wordlength  can  be  allocated.  This  indicates  that  the  multirate  DCT  architecture  can  not 
only  reduce  low-power  consumption,  its  numerical  properties  also  become  better  as  M  increases. 

The  above  analyses  can  be  extended  to  the  low-power  DCT  design  with  decimation  factor  equal 
to  M  ( M  >  2,  M  €  2+z).  The  results  are  given  by 


Pf 


(M  +  N,(k)  )(^t) 


’  R 


2m+i '  sin2(2m+1Wfc)  ’ 


(57) 


and 


M- 1 

D\  =  M  ■  C(k)  ^2  |cos[(2n  +  l)wfc]| , 

71=0 

f-'m-i 

£>2  =  c{k)  |cos[(2m+1n  +  2i  +  l)uk]  , 

71=0  1=0 

D  =  max{Di,  £>2}. 


(58) 
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with  m  =  log2  M. 

4.5  Simulation  Results 

To  verify  our  analytical  results,  computer  simulations  are  carried  out  by  using  the  aforementioned 
DCT  architectures.  The  input  sequence  is  a  random  sequence  with  uniform  probability  distribution 
over  the  interval  (-1,1).  All  the  results  are  based  on  the  average  of  1000  independent  DCT  compu¬ 
tations.  Fig.  10  shows  the  average  SNR  as  a  function  of  the  DCT  channel  number  k.  As  we  can  see, 
there  is  a  close  agreement  between  the  theoretical  and  experimental  results.  Basically,  the  SNR  dis¬ 
tribution  is  affected  by  the  factor  sin2(2m+1c<j/c)  in  (57)  so  that  its  period  varies  with  the  decimation 
factor  M.  It  should  be  noted  that  although  Fig.  10  (a)  and  (b)  yield  similar  SNR  results,  the  uniform 
dynamic  range  of  the  direct  form  I  structure  makes  it  a  better  choice  for  VLSI  implementations. 

Fig.  11  shows  the  relationship  between  the  average  SNR  and  the  wordlength  for  N  =  16.  Com¬ 
pared  to  the  simulation  results  in  [13],  the  three  HR  DCT  architectures  give  comparative  SNR 
performance  to  the  DCT  architectures  by  Hou  [15]  and  Lee  [16]  under  fixed-point  arithmetic.  It  is 
worth  noting  that  the  multirate  DCT  architectures  have  better  SNR  results  than  the  normal  HR  DCT 
architectures;  i.e.,  the  multirate  DCT  has  better  numerical  properties  under  fixed-point  arithmetic, 
which  is  consistent  with  what  we  have  seen  in  Table  4. 

In  summary,  the  analytical  results  presented  in  this  section  can  be  used  as  a  good  index  for  future 
applications  as  N  and/or  M  changes.  Furthermore,  we  can  assign  the  optimal  wordlength  for  each 
individual  DCT  channel  given  the  SNR  criteria,  while  this  is  not  the  case  in  the  fast-algorithm  based 
PIPO  DCT  structures  [15]  [16].  Due  to  the  characteristics  of  global  interconnections  in  the  PIPO 
DCT  structure,  each  operator  at  each  stage  will  affect  part  or  all  of  the  outputs.  Therefore,  it  is  not 
easy  to  find  optimal  wordlength  for  each  channel  in  the  PIPO  structure. 

5  Conclusions 

In  this  paper,  we  presented  some  new  aspects  of  the  multirate  low-power  design  discussed  in  the 
companion  paper  [1];  namely,  the  logarithmic-complexity  low-power  architecture,  unified  low-power 
HR  module  design,  and  optimal  wordlength  assignment.  We  have  shown  that  logarithmic  architecture 
is  a  good  choice  for  VLSI  implementation  when  both  low-power  dissipation  and  chip  area  are  taken 
into  consideration.  The  unified  HR  module  presented  in  Section  3  allows  us  to  perform  various 


Algorithm-Based  Low-Power  Transform  Coding  Architectures-Part  II 


18 


sinusoidal  transforms  using  the  same  dedicated  VLSI  architecture.  The  real-time  operations  as  well 
as  the  programmability  of  this  design  makes  it  a  promising  candidate  to  be  incorporated  into  the 
design  of  video  co-processor.  Finally,  the  finite- wordlength  analysis  gives  us  a  tool  to  achieve  a 
desired  SNR  by  choosing  minimal  wordlength.  It  not  only  reduces  the  total  switching  events  (hence 
the  power  dissipation),  but  also  provides  a  good  control  over  the  total  chip  area  under  the  SNR 
constraint.  The  materials  presented  in  this  paper,  together  with  the  multirate  approach  in  the 
companion  paper,  constitute  a  framework  of  the  algorithm-based  low-power  design  with  application 
to  transform  coding  kernel  design. 


Appendix 

Derivation  of  (56) 

Setting  x0(t)  to  zero,  Fig.8  is  reduced  to  the  HR  structure  depicted  in  Fig.9,  where  wt(t),  i  =  1,2, 
are  the  nodes  that  may  have  overflow.  It  is  easy  to  see  that 

Dite  —  max{u)i(t)}  =  C(k)  (|  cosw^l  +  |  cos  3u>fc|) .  (59) 


From  the  transfer  function  of  w2(t) 


H2(z)  = 


W2(z) 

Xe(z) 


=  C(k) 


cos3u;fc  —  cos  u>kZ  1 
1  -  2  cos  4ujcz~1  +  z~2  ’ 


we  can  derive  the  unit-sample  response  as 


(60) 


h2(n)  =  C(k)  cos[(4n  +  1  )u}k]u(n). 


(61) 


Thus, 

jV/2— 1  JV/2-1 

D2,e  =  max{w2(t)}  =  C(k)  ^  |cos[(4n  +  l)wfc]j  ■  max{x(n)}  =  C(k)  |cos[(4n  +  l)wfe]| .  (62) 

ra= 0  n=0 
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Similarly,  by  setting  xe(t)  =  0,  we  can  derive  the  dynamic  ranges  of  the  two  circled  nodes,  _Dlj0  and 
£>2,0,  as 


D i,0  =  C(k)  (|coswfc|  +  |cos3wfc|) , 

N/  2-1 

£>2,o  =  C(k)  ^2  |cos[(4n  +  3)wfc]|  •  (63) 

71=0 


Combining  (62)  and  (63)  together,  we  can  write  the  overall  dynamic  range  of  the  multirate  DCT  as 


£>i  =  £>i,e  +  £>i,o  =  2C(fc)  (I  coswfcl  +  |cos3wfc|), 

N/  2-1 

£>2  =  £>2,e  +  £>2,0  =  c(k)  Y2  (|cos[(4n  +  l)tc>jfe]|  +  |cos[(4n  +  3)u>fc]|) , 

71=0 

D  =  max{Di,jD2}.  (64) 
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Normal  DCT 
architecture  in  [2] 

Logarithmic  low-power 
DCT  architecture 

Full  low-power  DCT 
architecture  in  [1] 

Multipliers 

2N-2 

(log  M  +  2 )N  (in  order) 

( M  +  1  )N  (in  order) 

Adders 

2N 

(2  log  M  +  l)iV  (in  order) 

(M  +  1  )N  (in  order) 

Power  consumption 
for  16-point  DCT 

Po 

0.24P0  (M  =  4) 

O.llPo  (M  =  4) 

Table  1:  Comparison  of  hardware  cost  and  power  consumption  of  the  logarithmic  low-power  DCT 
architecture  with  other  approaches. 


Normal  Operation 

Downsampling  by  2 

Downsampling  by  4 

|  | 

Multiplier 

Adder 

Multiplier 

Adder 

Multiplier 

Adder 

HR  MLT 

5N 

5  N 

10  N 

UN 

201V 

23  N 

HR  ELT 

m 

m 

111V 

12N 

21 N 

24 N  , 

Table  2:  Comparison  of  hardware  cost  for  the  MLT  and  ELT  with  their  low-power  designs  in  terms 
of  2-input  multipliers  and  2-input  adders. 
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L 

Pi 

Uk 

9k 

Combination 

Function 

DCT 

N 

C(k) 

kir 

2N 

0 

XDCT,k(t)  =  Xc,k(t) 

IDCT 

N 

C{  1) 

-Uk 

XwCT'kit)  =  XGtk(t)  +  (C( 0)  -  C(l))x(n  -  N  +  1) 

DST-IV  in  [17] 

N 

C(  1) 

0 

XDST,k(t)  =  Xs,k(t) 

IDST-IV  in  [17] 

N 

C(  1) 

w  +  5) 

0 

XjDST,k  M  =  Xs<k(t) 

MLT 

2  N 

1 

V2N 

kit 

2  N 

?(*+*) 

XMLT,k(t)  =  ~S{k)[Xc,k+l{t)  +Xs,fc(t)] 

ELT 

4  N 

1 

2  s/2N 

2n(^  +  |) 

f(fc+l) 

XbLtA t)  =  —Xs,k+l(t)  +  V2Xc,k{t)  +  Xs,k-l(t) 

DFT 

N 

1 

y/N 

—  k  7T 

N 

-OJk 

Re{XDFT,k(t)}  =  Xc,k(t )>  Im{X DFT  k(t)}  =  Xs,k(t)- 

DHT 

N 

1 

y/N 

—  kir 

N 

-UJk 

XDHT,k(t)  =  Xc,k(t)  +  XSlk(t). 

Table  3:  Parameter  settings  for  the  unified  low-power  HR  transformation  architecture,  where 
Re{XoFT,k{t)}  and  Im{X  dft  denote  the  the  real  part  and  the  imaginary  part  of  the  DFT, 

respectively. 
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DCT  channel  k 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12  1  13 

14 

15 

Ba 

Bk 

(N  =  8) 

12 

11 

10 

9 

10 

11 

12 

N/A 

10.7 

Bk 

(N  =  16) 

13 

12 

12 

11 

11 

10 

10 

10 

10 

10 

11 

11 

12 

12 

13 

11.2 

(a) 


DCT  channel  k 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Ba 

Bk 

(N  =  8,  M  =  2) 

10 

9 

10 

11 

10 

9 

10 

N/A 

9.9 

Bk 

(N  =  16,  M  =  2) 

12 

11 

10 

10 

10 

11 

12 

12 

12 

11 

10 

10 

10 

11 

12 

10.9 

(b) 

Table  4:  Optimal  wordlength  assignment  under  the  constraint  SNR  =  40dB,  where  B>\  is  the  average 
wordlength.  (a)  Normal  HR  DCT.  (b)  Low-power  DCT  with  M  =  2. 
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Figure  1:  (a)  Polyphase  representation  of  Hj)CT,k(z)  in  cascade  form,  (b)  Multirate  DCT  architecture 
with  logarithmic  complexity,  where  w/t,  rc(m),  m  =  1,3,  are  defined  in  [1]. 
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x(t) 


Xc,k(t) 


Xs,k(* ) 


Figure  2:  HR  MLT  module  design. 


MLT  MR  Combination 

Module  Array  Circuit 


^MLT,0  W 
*  MLT.l  d) 

^  MLT,  2(0 
^MLT,3  (t) 
^  MLT,  4(0 

^  MLT.5  (0 
^MLT,6(t) 
^  MLT, 7  (t) 


Figure  3:  The  time-recursive  MLT  architecture. 
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Figure  5:  Rounding  error  in  the  HR  DCT  architecture. 
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M  =  2. 


SNR  (dB)  SNR  (dB) 
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Figure  10:  Average  SNR  as  a  function  of  DCT  channel  number  under  fixed-point  arithmetic  (N  =  16, 
B  =  12).  (a)  Normal  HR  DCT  using  direct  form  I  structure,  (b)  Normal  HR  DCT  using  direct  form 
II  structure,  (c)  Low-power  DCT  with  M  =  2.  (d)  Low-power  DCT  with  M  =  4. 
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Figure  11:  Average  SNR  as  a  function  of  wordlength  under  fixed-point  arithmetic  (1V=16).  The 
multirate  low-power  architectures  have  better  SNR  as  M  increases. 


